Enable DSA CP/absorbed/THD paths with TileLang fused ops#3674
Enable DSA CP/absorbed/THD paths with TileLang fused ops#3674HollowMan6 wants to merge 6 commits intoNVIDIA:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR extends the experimental DSAttention (DSA) path to support context parallelism (allgather CP) and packed THD masking, adds an “absorbed MLA” integration path, and introduces TileLang-based fused kernels (indexer + sparse MLA) with fallbacks and expanded unit coverage.
Changes:
- Enable DSAttention CP allgather masking and packed THD (varlen) masking, including sparse-KL streaming for indexer loss.
- Integrate absorbed-MLA tensor rewrite in
MultiLatentAttentionand route absorbed execution through DSAttention (with optional fused SparseMLA). - Add TileLang fused kernels/interfaces for indexer and sparse MLA, plus extensive new unit tests for CP/THD/absorbed parity and fused plumbing.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit_tests/transformer/test_attention_variant_dsa.py | Adds coverage for CP layout helpers, packed THD varlen masking parity, fused indexer loop behavior, streaming sparse-KL, and absorbed parity. |
| megatron/core/transformer/transformer_config.py | Removes the config-time guard that previously disallowed DSA with context parallelism. |
| megatron/core/transformer/multi_latent_attention.py | Adds absorbed MLA tensor rewrite and passes position_ids/up_v_weight into DSA for CP + absorbed execution. |
| megatron/core/transformer/experimental_attention_variant/ops/tilelang_sparse_mla_fwd.py | New TileLang sparse-MLA forward kernel + Python interface. |
| megatron/core/transformer/experimental_attention_variant/ops/tilelang_sparse_mla_bwd.py | New TileLang sparse-MLA backward kernels + Python interface. |
| megatron/core/transformer/experimental_attention_variant/ops/tilelang_indexer_fwd.py | New TileLang fused indexer forward kernel + logits “cleaning” kernel. |
| megatron/core/transformer/experimental_attention_variant/ops/tilelang_indexer_bwd.py | New TileLang fused indexer backward kernel + Python interface. |
| megatron/core/transformer/experimental_attention_variant/ops/sparse_mla.py | Autograd wrapper around TileLang sparse-MLA forward/backward. |
| megatron/core/transformer/experimental_attention_variant/ops/indexer.py | Autograd wrapper around TileLang indexer forward/backward and a helper for extracting top-k scores. |
| megatron/core/transformer/experimental_attention_variant/dsa.py | Core DSA updates: CP position/masking helpers, varlen bounds, fused top-k + streaming sparse-KL, scratch caching, absorbed sparse attention routing, and updated loss/masking plumbing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
megatron/core/transformer/experimental_attention_variant/ops/tilelang_sparse_mla_bwd.py
Outdated
Show resolved
Hide resolved
megatron/core/transformer/experimental_attention_variant/ops/tilelang_indexer_bwd.py
Show resolved
Hide resolved
megatron/core/transformer/experimental_attention_variant/ops/indexer.py
Outdated
Show resolved
Hide resolved
megatron/core/transformer/experimental_attention_variant/ops/tilelang_indexer_fwd.py
Outdated
Show resolved
Hide resolved
megatron/core/transformer/experimental_attention_variant/ops/tilelang_indexer_fwd.py
Outdated
Show resolved
Hide resolved
megatron/core/transformer/experimental_attention_variant/dsa.py
Outdated
Show resolved
Hide resolved
megatron/core/transformer/experimental_attention_variant/ops/tilelang_sparse_mla_fwd.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e0d5681007
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
megatron/core/transformer/experimental_attention_variant/ops/indexer.py
Outdated
Show resolved
Hide resolved
megatron/core/transformer/experimental_attention_variant/ops/tilelang_indexer_fwd.py
Show resolved
Hide resolved
megatron/core/transformer/experimental_attention_variant/ops/tilelang_indexer_fwd.py
Outdated
Show resolved
Hide resolved
This PR upgrades the DSA path end-to-end to support context parallel (allgather CP) with THD packing support, absorbed MLA integration, and fused TileLang kernels with safe fallbacks. Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Hollow Man <hollowman@opensuse.org>
|
We are changing our review process and marking all open, unlabeled PRs as draft. This change will go in effect starting once #3659 is merged. Moving forward, all PRs will be required to start as draft PRs. If you wish to get your PR merged, mark your PR as “Ready for review”. Read more about the new process at submit.md. |
What does this PR do ?
This PR upgrades the DSA path end-to-end to support context parallel (allgather CP) with THD packing support, absorbed MLA integration, and fused TileLang kernels with safe fallbacks.
Need together with #3026
Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.